Search CORE

88 research outputs found

A Detailed Analysis of Contemporary ARM and x86 Architectures

Author: Blem Emily
Menon Jaikrishnan
Sankaralingam Karthikeyan
Publication venue
Publication date: 01/01/2013
Field of study

RISC vs. CISC wars raged in the 1980s when chip area and processor design complexity were the primary constraints and desktops and servers exclusively dominated the computing landscape. Today, energy and power are the primary design constraints and the computing landscape is significantly different: growth in tablets and smartphones running ARM (a RISC ISA) is surpassing that of desktops and laptops running x86 (a CISC ISA). Further, the traditionally low-power ARM ISA is entering the high-performance server market, while the traditionally high-performance x86 ISA is entering the mobile low-power device market. Thus, the question of whether ISA plays an intrinsic role in performance or energy efficiency is becoming important, and we seek to answer this question through a detailed measurement based study on real hardware running real applications. We analyze measurements on the ARM Cortex-A8 and Cortex-A9 and Intel Atom and Sandybridge i7 microprocessors over workloads spanning mobile, desktop, and server computing. Our methodical investigation demonstrates the role of ISA in modern microprocessors? performance and energy efficiency. We find that ARM and x86 processors are simply engineering design points optimized for different levels of performance, and there is nothing fundamentally more energy efficient in one ISA class or the other. The ISA being RISC or CISC seems irrelevant

CiteSeerX

Minds@University of Wisconsin

Constraint Centric Scheduling Guide

Author: De Carli Lorenzo
Estan Cristian
Nowatzki Tony
Sankaralingam Karthikeyan
Sartin-Tarm Michael
Publication venue
Publication date: 19/02/2013
Field of study

The advent of architectures with software-exposed resources (Spatial Architectures) has created a demand for universally applicable scheduling techniques. This paper describes our generalized spatial scheduling framework, formulated with Integer Linear Programming, and specifically accomplishes two goals. First, using the ?Simple? architecture, it illustrates how to use our open-source tool to create a customized scheduler and covers problem formulation with ILP and GAMS. Second, it summarizes results on the application to three real architectures (TRIPS,DySER,PLUG), demonstrating the technique?s practicality and competitiveness with existing schedulers

Minds@University of Wisconsin

Constraint centric scheduling guide

Author: Cristian Estan
Karthikeyan Sankaralingam
Lorenzo De Carli
Michael Sartin-Tarm
Tony Nowatzki
Publication venue: 'Association for Computing Machinery (ACM)'
Publication date
Field of study

Crossref

Recommended from our members

Polymorphous architectures: a unified approach for extracting concurrency of different granularities

Author: Sankaralingam Karthikeyan
Publication venue
Publication date: 01/01/2006
Field of study

textProcessor architects today are faced by two daunting challenges: emerging applications with heterogeneous computation needs and technology limitations of power, wire delay, and process variation. Designing multiple application-specific processors or specialized architectures introduces design complexity, a software programmability problem, and reduces economies of scale. There is a pressing need for design methodologies that can provide support for heterogeneous applications, combat processor complexity, and achieve economies of scale. In this dissertation, we introduce the notion of architectural polymorphism to build such scalable processors that provide support for heterogeneous computation by supporting different granularities of parallelism. Polymorphism configures coarse-grained microarchitecture blocks to provide anadaptive and flexible processor substrate. Technology scalability is achieved by designing an architecture using scalable and modular microarchitecture blocks. We use the dataflow graph as the unifying abstraction layer across three granularities of parallelism-instruction-level, thread-level, and data-level. To first order, this granularity of parallelism is the main difference between different classes of applications. All programs are expressed in terms of dataflow graphs and directly mapped to the hardware, appropriately partitioned as required by the granularity of parallelism. We introduce Explicit Data Graph Execution (EDGE) ISAs, a class of ISAs as an architectural solution for effciently expressing parallelism for building technology scalable architectures. We developed the TRIPS architecture implementing an EDGE ISA using a heavily partitioned and distributed microarchitecture to achieve technology scalability. The two most signicant features of the TRIPS microarchitecture are its heavily partitioned and modular design, and the use of microarchitecture networks for communication across modules. We have also built aprototype TRIPS chip in 130nm ASIC technology composed of two processor cores and a distributed 1MB Non-Uniform Cache Access Architecture (NUCA) on-chip memory system. Our performance results show that the TRIPS microarchitecture which provides a 16-issue machine with a 1024-entry instruction window can sustain good instruction-level parallelism. On a set of hand-optimized kernels IPCs in the range of 4 to 6 are seen, and on a set of benchmarks with ample data-level parallelism (DLP), compiler generated code produces IPCs in the range of 1 to 4. On the EEMBC and SPEC CPU2000 benchmarks we see IPCs in the range of 0.5 to 2.3. Comparing performance to the Alpha 21264, which is a high performance architecture tuned for ILP, TRIPS is up to 3.4 times better on the hand optimized kernels. However, compiler generated binaries for the DLP, EEMBC, and SPEC CPU2000 benchmarks perform worse on TRIPS compared to an Alpha 21264. With more aggressive compiler optimization we expect the performance of the compiler produced binaries to improve. The polymorphous mechanisms proposed in this dissertation are effective at exploiting thread-level parallelism and data-level parallelism. When executing four threads on a single processor, significantly high levels of processor utilization are seen; IPCs are in the range of 0.7 to 3.9 for an application mix consisting of EEMBC and SPEC CPU2000 workloads. When executing programs with DLP, the polymorphous mechanisms we propose provide harmonic mean speedups of 2.1X across a set of DLP workloads, compared to an execution model of extra ting only ILP. Compared to specialized architectures, these mechanisms provide competitive performance using a single execution substrate.Computer Science

Texas ScholarWorks

Mechanisms for Parallelism Specialization for the DySER Architecture

Author: Govindaraju Venkatraraman
Ho Chen-Han
Nowatzki Tony
Sankaralingam Karthikeyan
Publication venue: University of Wisconsin-Madison Department of Computer Sciences
Publication date: 13/06/2012
Field of study

Specialization is a promising direction for improving processor energy efficiency. With functionality specialization, hardware is designed for application-specific units of computation. With parallelism specialization, hardware is designed to exploit abundant data-level parallelism. The hardware for these specialization approaches have similarities including many functional units and the elimination of per-instruction overheads. Even so, previous architectures have focused on only one form of specialization. Our goal is to develop mechanisms that unify these two approaches into a single architecture. We develop the DySER architecture to support both, by Dynamically Specializing Execution Resources to match program regions. By dynamically specializing frequently executing regions, and applying a set of judiciously chosen parallelism mechanisms--namely region growing, vectorized communication, and region virtualization--we show DySER provides efficient functionality and parallelism specialization. It outperforms an OOO-CPU, SSE-acceleration, and GPU-acceleration by up to 4.1x, 4.7x and 4x respectively, while consuming 9%, 86%, and 8% less energy. Our full-system FPGA prototype of DySER integrated into OpenSPARC demonstrates an implementation is practical

Minds@University of Wisconsin

Analyzing Behavior Specialized Acceleration

Author: Karthikeyan Sankaralingam
Tony Nowatzki
Publication venue: 'Association for Computing Machinery (ACM)'
Publication date
Field of study

Crossref